High-dimensional Outlier Detection: the Subspace Method
نویسنده
چکیده
" In view of all that we have said in the foregoing sections, the many obstacles we appear to have surmounted, what casts the pall over our victory celebration? It is the curse of dimensionality, a malediction that has plagued the scientist from the earliest days. " – Richard Bellman 1. Introduction Many real data sets are very high dimensional. In some scenarios, real data sets may contain hundreds or thousands of dimensions. With increasing dimensionality, many of the conventional outlier detection methods do not work very effectively. This is an artifact of the well known curse of dimensionality. In high-dimensional space, the data becomes sparse, and the true outliers become masked by the noise effects of multiple dimensions, when analyzed in full dimensionality. A main cause of the dimensionality curse is the difficulty in defining locality for the high dimensional case. For example, proximity-based methods define locality with the use of distance functions. On the other hand, it has been shown in [65, 215], that all pairs of points are almost equidistant in high-dimensional space. This is referred to as data spar sity. Since outliers are defined as data points in sparse regions, this results in a poorly discriminative situation where all data points are sit uated in an almost equally sparse regions in full dimensionality. The challenges arising from the dimensionality curse are not specific to out lier detection. It is well known that many problems such as clustering and similarity search experience qualitative challenges with increasing
منابع مشابه
A Novel Subspace Outlier Detection Approach in High Dimensional Data Sets
Many real applications are required to detect outliers in high dimensional data sets. The major difficulty of mining outliers lies on the fact that outliers are often embedded in subspaces. No efficient methods are available in general for subspace-based outlier detection. Most existing subspacebased outlier detection methods identify outliers by searching for abnormal sparse density units in s...
متن کاملRobust Subspace Outlier Detection in High Dimensional Space
Rare data in a large-scale database are called outliers that reveal significant information in the real world. The subspace-based outlier detection is regarded as a feasible approach in very high dimensional space. However, the outliers found in subspaces are only part of the true outliers in high dimensional space, indeed. The outliers hidden in normalclustered points are sometimes neglected i...
متن کاملExample-Based DB-Outlier Detection from High Dimensional Datasets
Outlier detection is an important problem that has applications in many fields. High dimensional datasets are common in such applications. Among the existing outlier detection methods, Distance-Based outlier (DB-Outlier) detection is one of the most generalizable and simplest approaches. It finds outliers by calculating distances between data points. However, in high dimensional space, data dis...
متن کاملRandom Subspace Learning Approach to High-Dimensional Outliers Detection
We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets. Essentially, we avoid the computational bottleneck of techniques like Minimum Covariance Determinant (MCD) by computing the needed determinants and associated mea...
متن کاملA Robust Method for Detecting DB-Outliers from High Dimensional Datasets
Outlier detection is a popular technique that can be utilized in many modern applications like financial analysis and fraud detection. As data description becomes complex, operated datasets’ dimensionalities keep monotone increasing. However, current researches find that it is extremely difficult to pick out outliers directly from high dimensional datasets owing to the curse of dimensionality. ...
متن کاملDetecting High-Dimensional Outliers: the New Task, Algorithms and Performance
Outlier detection is a fundamental step in knowledge discovery in databases. With the increasing number of high-dimensional databases, existing outlier detection algorithms that work only in the context of full space are unable to effectively screen out informative outliers. This is because majority of these outliers exists only in subspaces. In this paper, we identify a new outlier detection t...
متن کامل